The BIOfid portal is aware of the semantic context of a term. A semantic context, in this context, means that the portal "knows" that a term is mentioned in the documents with other terms more or less often.
For example, you would expect the term "Fagus" (i.e. beeches) to occur very often in documents that mention the term "plants". This concept can be extended to the level that you search for "Fagus" (or the BIOfid-URI for "Fagus) and not only retrieve the BIOfid-URI (and its label) for "plants", but also other taxa (both plants and animals) that are mentioned often with "Fagus" in the texts.
The BIOfid API getTermContext allows you to retrieve the most common URIs that
are associated with a given term. For example, to get all associated terms in the
BIOfid corpus for "Fagus" (which has the BIOfid-URI https://www.biofid.de/bio-ontologies/Tracheophyta/gbif/2874875),
you can put this URL into a script or your browser's address bar:
https://www.biofid.de/api/v1/getTermContext?term=https://www.biofid.de/bio-ontologies/Tracheophyta/gbif/2874875
This will get you a large data output. Here is an example of how the data can look:
...
{
"uri": "https://www.biofid.de/bio-ontologies/Tracheophyta/gbif/2874875",
"count": 1244, // The number of articles where the given term and the "uri" are mentioned together
"type": "Taxon", // The type of the given "uri", currently either "Taxon" or "Location"
"documents":[0,1,2,3,4,5,6,...., 1243], // A reference to the document URL index in the same dataset.
"label": "Fagus L.", // The label for the given URI, if available
"sameAs": ["https://www.gbif.org/species/2874875"] // SameAs-relationship to other infrastructures.
}
...
Below you find an example of how this data can be evaluated, using the moth family
of the Zygenidae. You should be able to change the TERM_STRING either for
a BIOfid-URI, a Wikidata-URI, or a literal term and run the script again.
For the figure, the script filters for species explicitly (so, no genus or family data is considered).
This is the script to generate a figure for a TDWG-presenation:
Pachzelt A, Kasperek G, Lücking A, Abrami G, Driller C (2021) Semantic Search in Legacy Biodiversity Literature: Integrating data from different data infrastructures. Biodiversity Information Science and Standards 5: e74251. https://doi.org/10.3897/biss.5.74251
from scripts.commons import Biofid
import json
from copy import copy
TERM_STRING = 'https://www.biofid.de/bio-ontologies/Lepidoptera/gbif/8875' # Zygenidae
biofid = Biofid()
term_data = biofid.get_term_context(TERM_STRING)
show_data = copy(term_data['results'][0])
show_data.pop('documents') # This is a large numeric list
print(json.dumps(show_data, indent=4))
{
"uri": "https://www.biofid.de/bio-ontologies/Lepidoptera/gbif/8875",
"count": 27,
"type": "Taxon",
"label": "Zygaenidae",
"sameAs": [
"https://www.gbif.org/species/8875"
]
}
import pandas as pd
def is_species(uri: str) -> bool:
try:
taxon_data = biofid.get_biofid_data_for_uri(uri)
return any(
row['object']['value'] == 'https://www.biofid.de/bio-ontologies#Rank_Species'
for row in taxon_data['data'])
except (IndexError, ConnectionError):
return False
associated_tracheophyta = []
for term in term_data['results']:
if term['type'] == 'Taxon':
if 'Tracheophyta' in term['uri'] and is_species(term['uri']):
df = pd.concat([pd.DataFrame.from_dict({'uri': term['uri'], 'label': term['label'], 'count': [term['count']]})
for d in term_data], ignore_index=True)
associated_tracheophyta.append((term['uri'], term['label'], term['count']))
print(df.head())
uri label \ 0 https://www.biofid.de/bio-ontologies/Tracheoph... Urtica dioica Fischer 1 https://www.biofid.de/bio-ontologies/Tracheoph... Urtica dioica Fischer 2 https://www.biofid.de/bio-ontologies/Tracheoph... Urtica dioica Fischer count 0 3 1 3 2 3
import plotly.graph_objects as go
def filter_data_by_article_count(dataset: list, min_article_count: int) -> list:
return list(filter(lambda x: x[2] >= min_article_count, dataset))
def generate_range_data(dataset, label=None, start: int = 0) -> list:
return [label if label is not None else i for i in range(start, len(dataset) + 1)]
MIN_TAXON_ARTICLE_COUNT = 6
MIN_LOCATION_ARTICLE_COUNT = 3
associated_locations = []
for term in term_data['results']:
if term['type'] == 'Location' and term['label']:
associated_locations.append((term['uri'], term['label'], term['count']))
associated_tracheophyta_with_min_count = filter_data_by_article_count(
associated_tracheophyta, min_article_count=MIN_TAXON_ARTICLE_COUNT)
associated_locations_with_min_count = filter_data_by_article_count(
associated_locations, min_article_count=MIN_LOCATION_ARTICLE_COUNT)
taxon_group_target = generate_range_data(
associated_tracheophyta_with_min_count, label=1)
location_group_target = generate_range_data(
associated_locations_with_min_count, label=2)
# Include the original term as source (Index 0)
merged_sources = [0, 0]
merged_sources.extend(taxon_group_target)
merged_sources.extend(location_group_target)
merged_targets = [1, 2]
merged_targets.extend(
[i for i in range(3, len(taxon_group_target) + 2)]
)
start_count_locations = len(taxon_group_target) + 1
merged_targets.extend(
[i for i in range(start_count_locations, len(location_group_target) + start_count_locations)]
)
labels = [
term_data['results'][0]['label'],
'Tracheophyta',
'Locations'
]
labels.extend(term[1] for term in associated_tracheophyta_with_min_count)
labels.extend(term[1] for term in associated_locations_with_min_count)
taxon_values = [term[2] for term in associated_tracheophyta_with_min_count]
location_values = [term[2] for term in associated_locations_with_min_count]
merged_values = [
sum(taxon_values) + 5,
sum(location_values) - 5
]
merged_values.extend(taxon_values)
merged_values.extend(location_values)
x_values = [None for i in range(0, len(labels) + 1)]
y_values = [None for i in range(0, len(labels) + 1)]
y_values[2] = 0.6
fig = go.Figure(data=[go.Sankey(
node = dict(
pad = 15,
thickness = 20,
label = labels,
x = [0, 0, 0.5],
y = [0, 0, 0.8]
),
link = dict(
source = merged_sources,
target = merged_targets,
value = merged_values
))])
fig.write_image('term-associations.png', scale=3)
fig.show()
The generated figure illustrates the that the Zygenidae are associated in the BIOfid corpus with indicator species for calcereous grassland. Since Zygenidae have a preference for this ecosystem, this is not surprising. However, this relation was established by the BIOfid portal only by indexing documents, no Machine Learning involved.